Detection of hotspots of school dropouts in India: A spatial clustering approach

School dropout is a significant concern universally. This paper investigates the incorporation of spatial dependency in estimating the topographical effect of school dropout rates in India. This study utilizes the secondary data on primary, upper primary, and secondary school dropout rates of the different districts of India available at the Unified District Information System for Education plus (UDISE+) for the year 2020 to contemplate the impact of these dropouts from one region to different regions in molding with promotion rate and repetition rate. The Global Moran’s I, Univariate and Bivariate Local Indicators of Spatial Association, and spatial models are utilized to investigate the geographical variability and to find the possible relationship between dropout rates and the school-level factors at the district level. The outcomes provide clear spatial clustering and precisely highlight the hot zone dropout regions with high repetition and low promotion rates. Based on this study’s results, educational administrators can make evidence-based decisions to reduce dropout rates in hot zones of various regions of India. Furthermore, futuristic studies focusing on linking spatial hot zones with causal factors will add consistent data in assisting policymakers in taking necessary measures to develop a sound education management system.


Introduction
Education is a foundation for human progress toward creating a healthy society. Its effects are significant in the development of individuals and the whole country. India's school education vision 2030 intends to qualitatively improve the nation's current educational system and provide high-quality education to all children of the school-attending age group, whose numbers are estimated to climb from 25 crores in 2010 to 30 crores in 2030. Though the school enrolment rates have increased in the past few years, the percentage of students who drops out of school has either remained the same or increased. As per UDISE+ 2020 report, the dropout rate in the secondary level (17%) is still high compared to the primary (1.8%) and upper primary level (1.5%). This percentage of school dropouts adds a quantitative inclusion burden to India's vision of establishing education goals [1].
The Dissimilarity in the primary education curriculum and school infrastructure has been noted to play a key role in primary education outcomes. As a result, the government executed a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 a new policy named "universalization of elementary education" (UEE) in 2001. UEE focuses on three major elements: universalization, which guarantees that all students between the ages of 6 and 14 have direct exposure to a school; enrolment universalization, which guarantees that all students in the aforesaid age are enrolled in school; and retention universalization, which guarantees that students who started primary school progress till they complete the upper primary level [2,3].
In the continuum of the retention universalization goal, research on factors influencing school dropout generally concentrates on child, family, school and community-related factors [4][5][6][7][8]. Only a few studies incorporated the regional variation in the proportion of dropout rate [9]. The use of geographic information systems in educational research, planning, and policy strategizing is a relatively recent development. However, it is becoming more common as researchers realise the benefits of providing a visual depiction of statistical data [10]. A recent study from India used spatial techniques to identify district-level variation between education programs and literacy rates based on the first major element of UEE. School dropouts remarkably increase quantitative content to the third major element of UEE-retention universalization [11]. Universally, Jose Eos Trinidad (2022) used spatial tools to analyze the determinants of high school dropout with race and poverty in New York City [12]. Mark. J. Schafer (2006) performed school-level spatial analysis to determine the relationship between school-level factors and high school dropout in Louisiana [13]. In India, Our study is the first to examine the regional variability and school-level risk factors of primary, upper primary and secondary dropouts across all the districts of India. District-level promotion rate and repetition rate of boys and girls are the School level risk factors examined in this study for possible impact on dropout. Each district is chosen as the analytical unit and used as a spatial region to highlight the regional variations in dropout and school-level factors in contemplation to carefully explore the relationships adjusting the potential covariates resulting from geographical influences.
To improve the efficiency of the school management framework, this approach incorporating geospatial technology will assist policy-makers and researchers in better understanding the existing status. Additionally, this would support the Sarva Shiksha Abhiyan and National Education Policy in providing direction for formulating preventive measures to decrease dropouts. The spatial representations of education policies across districts can help to guarantee that every student in India completes their schooling at any cost to enhance their quality of life. Policy-makers and government representatives can also utilize findings from this study to improve the current policies, fostering India's school education and assisting in developing, testing, and implementing cost-effective strategies in hotspot regions to reduce dropouts in India.

Data
The UDISE+ of India recently released data for the year 2020, accessible on the UDISE+ website https://dashboard.udiseplus.gov.in. The reports have been published under the Department of School Education and Literacy (DoSEL), Ministry of Education, Government of India. This report is based on data that schools with active UDISE+ codes in a reference year voluntarily uploaded using a data collection format (DCF) specifically created for this report. The State/UT government of the school's location assigns the UDISE+ code for institutions. The District Education Officer (DEO) at the district level ensures that the information entered into the DCFs is accurate. This report offers essential information on several factors, such as the number of schools, teachers, and students who were enrolled, promoted, and dropped out, in terms of counts and percentages. This data source is the input for the analysis [1].

Response and predictors
Following the conceptual frameworks of earlier studies [12][13][14], the district-level overall dropout rates for primary, upper primary, and secondary levels were the outcome variables considered in this study. Utilizing the aggregate district-level data, this study included six independent variables: promotion rate, repetition rate, and a dropout rate of boys and girls.

India digital map
The primary researcher downloaded India's district-level base map from kaggle at https:// www.kaggle.com/datasets/raghulgandhi/indian-district-map-726. Initially, the number of districts reported in UDISE+ was 733. However, a few districts in Arunachal Pradesh, Delhi, Karnataka, Manipur, Tamil Nadu, and West Bengal were merged for analysis purposes using their boundary, and the number of districts considered for the analysis was 726 given in S1 Dataset.

Analysis
To examine the geographic distribution of dropouts in primary, upper primary, and secondary levels in India, several quartile maps at the district-level were created. To investigate the geographical correlation and grouping of districts, Moran's I, LISA cluster map, and significance maps were created. To calculate the distance in space between every potential pair of observable units in the dataset. The Queens' contiguity approach has been used to produce the spatial weight matrix of order one [15]. Since the regional coordinates and attribute data on school dropouts are discontinuous, no clear spatial relationship exists. According to Queens' technique, neighbors are geographical districts with a non-zero length shared border. The Moran's I statistic reveals how identical or distinct observed values are to their geographical neighbors [16]. Therefore, values for the univariate and bivariate LISA were generated to examine the spatial correlation of the dropout rates among districts. The Moran's I statistic can be calculated using the following formula [17]: Where y denotes the dropout and � y denotes the average of y; n denotes the number of districts; W mn denotes the standardised weight matrix connecting observation m and n; and S 0 denotes the total of all geographical weights.
Where y and z represent the dropout and predictors; � Y represents the average of y; � Z represents the average of predictors; n denotes the number of districts; W mn denotes the standardised weight matrix connecting observation m and n; and S 0 denotes the sum of all geographical weights.
A positive spatial autocorrelation suggests that spots with identical data points are strongly connected in the area, while a negative spatial autocorrelation shows that strongly connected spots are more distinct. The values of Moran's I typically range from (-1, 1), with positive measures indicating the geographical grouping of comparable measures and negative measures indicating the spatial grouping of different measures. In the absence of any spatial autocorrelation, a measure of 0 indicates a random geographical distribution. The association of nearby values around a particular geographic region is measured by univariate LISA [17]. It establishes the degree of spatial grouping and unpredictability that the data exhibit [18,19].
The following formula provides the measure I i : Four different types of autocorrelations were identified based on the Moran's scatter plots and are referred to as: • Districts with high measures and identical neighbouring districts are known as "hot spots" (High-High).
• Districts with low measures and identical neighbouring districts are known as "cold spots" (Low-Low).
• Districts that are high in measures but have low-measure neighbouring districts (High-Low) and districts that are low in measures but have high-measure neighbouring districts (Low-High) are known as "Spatial Outliers".
In the same way, bivariate LISA were also calculated to examine the relationship between the repetition rate and promotion rate of both boys and girls of regions with different dropout rates.
LISA cluster and significance map were produced in the Geo-Da by utilizing the LISA tools. The map shows the districts with significant Moran's I value categorized in terms of spatial autocorrelation, where hotspots are defined by red, coldspots by deep blue, and spatial outliers by light blue and light red. To investigate the possible relationships between the dropouts and predictors, we carried out statistical regressions. We first used the ordinary least square (OLS) model, then we calculated the spatial autocorrelation in the OLS regression residuals to check the spatial heterogeneity caused by spatial dependency. As soon as we determined that the Moran's I statistic for each of the outcomes was statistically significant, we calculated the spatial lag model (SLM) and spatial error model (SEM) to obtain unbiased measures of the correlations between the predictors and dropout while addressing the geographical heterogeneity that existed in the data. A standard SLM assumes that the data points are spatially dependent and lag to one another in the nearby regions, In contrast, the SEM assumes that the disturbance terms are correlated with nearby geographical units. The best model was then determined by comparing the Akaike Information Criterion (AIC) and Schwartz Criterion (BIC) values, and we observed that SEM provided the better fit for this particular study.
The fundamental formula for OLS is as follows: Where Y is the response, x denotes the predictors' vector, α is the model's intercept and β is the associated coefficient vector. The assumption is that the error component (ε) is identically and independently distributed (i.i.d). If it is found that there is a significant spatial dependency, simultaneity bias in the OLS likely result in erroneous and inaccurate estimations of the model indicators. However, SLM and SEM models are fitted as follows to limit the topographical effects. SLM, if the response variable (Y), is correlated to the weighted mean of the observations in its surrounding regions, where ρ is the auto-regressive parameter, then SEM, if residuals reveal spatial dependency, the subsequent model effectively manages the spatial effect.
Here, λ denotes the auto-regressive parameter; z is the i.i.d. disturbance term. By increasing the relevant likelihood functions, both SEM and SLM are estimated [13]. QGIS desktop 3.26.2 and GeoDa 1.20.0.8 software were used for the statistical analysis. Table 1 gives the nation's overall dropout rate, promotion rate, and repetition rate. Results show that the secondary dropout rate is the biggest concern than the upper primary and primary dropout rates in India. The summary of the predictors shows that 99 percent of students at the primary level, 97 percent of students at the upper primary level and 84 percent of students at the secondary level were promoted. In the case of the repetition rate, 0.51 percent of students at the primary level, 0.56 percent of students at the upper primary level and 1.85 percent of students at the secondary level repeated their grades in India.

Spatial pattern and clustering of school dropouts across India
The spatial distribution of districts' primary, upper primary and secondary dropout rates is depicted in Fig 1A-1C. The color highlights the geographical variations in dropout rates. The lighter color represents a lower dropout rate, whereas the darker color represents a greater dropout rate in certain districts. From the spatial maps, it is obvious that dropout rates vary geographically throughout the districts.
The estimated values of the Moran's I statistic shows primary, upper primary and secondary dropout rates across the districts have spatial autocorrelation. The LISA cluster map in Fig 2A  and the significance map in Fig 2B show 57 hotspots in primary dropout across the Indian districts. Similarly, Fig 2C shows 53 hotspots in upper primary dropout in Indian districts. In secondary dropout, 58 hotspots were identified and are shown in Fig 2E. The number of districts in each state which are hotspots is given in Table 2 and the S1 Appendix gives Moran's, I statistic, cluster number and p-value for all 726 districts.
The estimated values of bivariate LISA shown in Table 3 indicate the results of the spatial relationship between the primary, upper primary and secondary dropout rate with the other predictors taken in this study. Among the various predictors, the promotion rate of boys and girls consistently showed a negative, dropout rate of boys and girls showed a positive spatial autocorrelation at the 5% level of significance with primary, upper primary and secondary dropout across the districts. In the case of Upper primary dropout, the repetition rate of boys and girls showed positive spatial autocorrelation at the 5% level of significance. In the case of secondary dropout, boys' repetition rate showed a negative spatial autocorrelation at significance level (<0.05) and girls' repetition rate showed very low (insignificant) spatial autocorrelation. Table 4 presents findings from OLS, SLM, and SEM models that describe how factors affect different dropouts after controlling for topographical effects. Based on the model selection criteria, SEM was observed to be the better-fitted model for all primary, upper primary, and secondary dropouts.

Estimated outcomes from the OLS, SLM and SEM models
Primary dropout. The estimated coefficients of OLS regression were 0.103(<0.05) for boys' promotion rate, -0.113(<0.05) for girls' promotion rate, 0.604(<0.05) for boys' dropout rate, 0.418(<0.05) for girls dropout rate were statistically significant to primary dropout. After making spatial modifications using the spatial model, it was observed that the pattern of the relationship between predictors and primary dropouts remained the same. We found SEM (2415.43, 2447.55) as the best fit since it has the lowest AIC and BIC values when comparing SLM (2430.42, 2467.12) and OLS (2428.53, 2460.65).
Upper primary dropout. The estimated coefficients of OLS regression were -0.255(0.001) for boys' promotion rate, 0.245(0.001) for girls' promotion rate, -0.345(0.001) for boys' repetition rate, 0.327(0.001) for girls' repetition rate, 0.229(0.001) for boys dropout rate, 0.785(0.001) for girls dropout rate were statistically significant to upper primary dropout. After making spatial modifications using the spatial model, it was observed that the pattern of the relationship between predictors and upper primary dropouts remained the same. The corresponding AIC and BIC values of OLS (2405.81, 2437.92), SLM (2407.64, 2444.34) and SEM (2405.63, 2437.74) are obtained and SEM is considered to be the best fit.

PLOS ONE
Detection of hotspots of school dropouts in India

Discussion
School graduation denotes promotion from primary to secondary. Hence high promotion rate yields a collective benefit in aligning towards the retention universalization policy of UEE, propelling the whole education system processing towards India's vision for 2030. Our study, based on geospatial techniques, reveals that School dropouts are one of the essential criteria for increasing a country's overall school graduation percentage. Hence, dropping out is an issue that has to be solved [20]. Researchers have examined potential influences and identified significant effects of child, family, school, and community factors [21][22][23][24][25]. In addition to these specific concerns, the impact is even worse when child and family factors such as sex, caste, and religion combine with school-level factors such as attendance, pupil-teacher ratio, and school infrastructure [9,26,27]. An additional list of objects that are rarely researched is spatial factors.
The current study intends to apply the existing method for assessing the geographical neighborhood factors that influence dropout incidence and to evaluate significant findings on these factors in India. We emphasize that spatial statistics can give valuable information for analyzing dropout distribution and point out its importance.
With the use of the district-level information from the UDISE+ India of 2020, a LISA cluster map is generated. Based on the thematic maps, we observed that district dropout rates were not simply random. In particular, several districts in Arunachal Pradesh, Assam, Bihar, Gujarat, Jammu & Kashmir, Jharkhand, Madhya Pradesh, Manipur, Meghalaya, Mizoram, Nagaland, Odisha, Tamil Nadu, Tripura, and Uttar Pradesh are considered to be hotspots for dropouts in the primary, upper primary and secondary levels because these clusters of neighborhoods have a low rate of promotion and a high rate of repetition.
Further investigation on hotspots reveals a correlation between the dropouts and the promotion rate and repetition rate of boys and girls. Although previous studies have linked child and family factors to a higher risk of dropout [28], the current study emphasises that districts with high repetition rates and low promotion rates may also impact dropout rates. Districts in Arunachal Pradesh, Jammu & Kashmir, Nagaland, Assam, and Bihar have stood out, particularly in these hotspots. Even though dropout rates are greater in low promotion and high repetition rate districts, this does not imply that dropout rates are also higher in these regional clusters. There are several districts where the dropout and promotion rates are inversely proportional. In addition to the few co-locational clusters, low promotion rates, and high repetition rate districts also have outliers in terms of dropout rate. So, Researchers must be cautious when making conclusions. They must also qualitatively explore why certain nearby districts have noticeably different dropout rates despite having a nearly identical promotion and repetition rates. It is feasible that the dropout rate might be less due to the standard initiatives implemented between blocks at the district level [29].
Despite being focused on Indian districts, this study's methodology may also be employed at the block level. Understanding the spatial features and clusters of districts with greater block-level dropout rates is more required than merely identifying the districts with the highest dropout rates. In this regard, geospatial studies can assist in identifying block-level dropout hotspots to help make decisions that possess the capacity in changing district-level dropout trends.
In this study, we emphasized the importance of spatial analysis of district-level dropout rates using quartile, univariate-bivariate LISA, and spatial autoregressive models. It is possible to find the district-level variations of dropout rates using quartile maps. Clustered LISA maps displayed districts with comparable high or low dropout rates close to one another. This identification of dropout hotspots or coldspots can increase the effectiveness of contextualized interventions [30]. LISA maps highlight outliers, and finding outliers might encourage the qualitative researcher to investigate what would make a specific district differ from its neighboring districts. Additionally, spatial regressions were employed in this study to investigate the possible correlation between the dropout and promotion-repetition rates of boys and girls.
Although this study contributes some conceptual and methodological insights, it has certain limitations. Firstly, GIS-based research will always have difficulty with data availability, and this study also faced the same. Second, this study analyzed the pattern of district-level dropout rates and their influencing school-level characteristics across districts within a district-level framework. This analysis might be taken further and applied to block levels to determine the intra-district variation in student dropout rates, which could help to determine school-level factors influencing the dropout variation within districts. Third, though this study examined the regions in India with the highest dropout rates, our study only focused on spatial factors and not the causative framework. Fourth, we exclusively considered the characteristics at the school level specific to the promotion-repetition rates and excluded other factors related to child, family and community. This study is the first attempt to use spatial analysis for dropout data at the district level in India, therefore this limitation offers an opportunity for future work to expand the knowledge about how certain factors might be geographically clustered and examine regional patterns.
Despite its limitations, this study highlights the conceptual findings to advance our understanding of how dropout rates are influenced by geographical locations and methodological strategies for using GIS tools to investigate educational sector challenges. The inference that dropout rates in India are geographically grouped in specific districts in the north-eastern and central states is thematically supported by the evidence. These factors are not deterministic, although high dropout rates are correlated with low promotion rates and high repetition rates.
This study recommends and supports the use of GIS approaches that are suitable for concerns and challenges in education. Such techniques can enhance conceptual knowledge and practical implications by identifying spatially effective remedies. These conceptual and methodological contributions may motivate the researchers to explore the methodology for realworld problems including strategic planning, policy-making, and decision-making in the education sector.

Conclusion
Although school enrolment in India has seen a substantial increase over a few years, there is still no decrease in dropout rates. This study provides clear evidence that school dropout is a persisting issue in India affecting educational attainment progress. The results of our research indicate geospatial district-level variations in primary, upper primary, and secondary dropouts.
In addition to highlighting the geographical variations in dropout rates, this study examines the association between dropouts and promotion-repetition rates. In particular, it shows a strong negative relationship between promotion rates and school dropouts, and a positive relationship between the repetition rates and the dropouts. Based on the findings, the study recommends that India's education policy must target the hotspot districts with high dropouts rate.
Extending the study to analyze and correlate all causal factors in hot spot districts will throw light on creating navigable pathways for comprehensive intervention strategies. In addition, the maps and tables provided in our study will provide a substantial base for policymakers to initiate conversations based on highlighted hotspots. This kind of GIS-based study can assist the first step in developing a strategy to enhance the nation's educational infrastructure and, in turn, its economic situation.
Though the government has been consistently making efforts to improve the educational standards, the north-eastern and central districts in India need to be in a better position to provide standard education due to the high number of hotspots in this region. In the continuum to this conclusion, this study calls for rigorous preventive measures to reduce dropouts in achieving quality education. with the ultimate goal of achieving global education standards that will enable our future generations to strive and prosper more successfully on the planet.